[Bug] Fix access to GPU by adding gpus all on warmup and launch warmup#36
[Bug] Fix access to GPU by adding gpus all on warmup and launch warmup#36SangaraSorama wants to merge 2 commits into
gpus all on warmup and launch warmup#36Conversation
--gpus all on launch warmupgpus all on warmup and launch warmup
GPU Detection FixWhen running training with Elephant on the dataset, the process may default to the CPU. To ensure the training runs on the GPU, the following changes were made to the Makefile in the Docker/elephant-server directory. 1. Force GPU usage in the launch targetIn Docker/elephant-server/Makefile, within the launch target: Before : $(ELEPHANT_DOCKER) run -it --rm $(GPU_ARG) --shm-size=8g -v $(ELEPHANT_WORKSPACE):/workspace \After : $(ELEPHANT_DOCKER) run -it --rm $(GPU_ARG) --gpus all --shm-size=8g -v $(ELEPHANT_WORKSPACE):/workspace \This change ensures that the container explicitly requests access to all GPUs. 2. Improve GPU detection logic in the warmup targetIn Docker/elephant-server/Makefile, the warmup rule was updated to define a fallback behavior when ELEPHANT_GPU is not set. Before : warmup:
$(eval GPU_ARG:=$(shell \
if [ -n "$(ELEPHANT_NVIDIA_GID)" ] && [ -n "$(ELEPHANT_GPU)" ]; then \
VAR=$$(echo --gpus '"device=$(ELEPHANT_GPU)"'); \
fi;\
echo $$VAR))
@if [ -n "$(GPU_ARG)" ]; then \
$(ELEPHANT_DOCKER) run -it --rm $(GPU_ARG) $(ELEPHANT_IMAGE_NAME) echo "warming up GPU..."; \
else \
echo "CPU mode..."; \
fiAfter : warmup:
$(eval GPU_ARG:=$(shell \
if [ -n "$(ELEPHANT_NVIDIA_GID)" ] && [ -n "$(ELEPHANT_GPU)" ]; then \
VAR="--gpus device=$(ELEPHANT_GPU)"; \
else \
VAR="--gpus all"; \
fi; \
echo $$VAR))This ensures that when a specific GPU is not defined, the Docker container still uses all available GPUs instead of falling back to CPU execution. Note The updated Makefile is provided as an attachment in .tar.gz format. Contributors: Jan Amouroux, Célia Brahimi, and Camille-Astrid Rodrigues |
What are the changes the user will see?
The GPU access should not fail anymore, regardless of the GPU type the user has.
Why am I making these changes?
Currently, a lot of GPUs are not detected when trying to access them to use ELEPHANT once the container works.
Example of issue reported : https://forum.image.sc/t/issues-with-gpu-on-local-elephant-server/91424
This pull request should fix this issue.
What are the changes from a developer perspective?
Changes the
Makefile:--gpus allinlaunch: warmupwarmupto default back to--gpus allif no GPU is originally foundHow to test the changes?
Try to install ELEPHANT in 2 contexts :
In both cases, the GPU should be accessed properly.
We only tested on a computer with a GPU that was not accessible before the fix (NVIDIA GeForce RTX 4070 Laptop GPU)